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1 Introduction 

The Poisson distribution is the standard model for the analysis of count data. How- 
ever, in many situations this type of observations exhibit a substantially larger pro- 
portion of zeros than what is expected for the Poisson model (see Gupta et al. 
(1996)). For instance, this is often the case with count data coming from medical 
and pubhc health research (see Bohning et al. (1999) and Campbell et al. (1991)). 
This phenomenon usually arises when the distribution generating the data is a mix- 
ture of two populations, the first of which yields Poisson-distributed counts whereas 
the second one always contributes with a zero. 

One natural model to describe the above situation is the so-called zero-inflated 
Poisson (ZIP) model. We say that the random variable Y{6,p) has a ZIP distribution 
with parameters 9 and p {9 > and < p < 1) if 



Therefore, Y{9,p) is a mixture of a degenerate-at-zero distribution (with weight p) 
and a Poisson distribution of mean 9 /{I — p) (with weight 1 — p). In particular, 
Y{9, 0) is the classical Poisson variable with mean 9. The ZIP distribution has been 
used in diverse areas such as medicine (Bohning et al. (1992, 1999) and van den 
Broek (1995)) or biology (Nie et al. (2006)), among others. 

The expected value of the ZIP distribution is E(y(^,p)) = 9 and the variance 
Var(y(^,p)) — 9 -\- 9'^p/{l — p) increases as p increases. The zeros coming from the 
degenerate variable are called structural zeros and those from the Poisson model 
sampling zeros. It should be observed at this point that, to keep the mean fixed for 
different values of p, we do not follow the usual notation for the ZIP models. 

If the proportion of atypical zero observations remains undetected, the variabi- 
lity of the population is underestimated and the properties of standard inference 
techniques are, to some extent, deteriorated. For this reason, in the recent literature 
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there are different proposals to determine whether the Poisson model fits a data set 
well enough or, alternatively, we should choose a ZIP model that allows for an extra 
proportion of zero counts. A clear and concise review of several of these tests can be 
found in Xie et al. (2001). A popular and simple choice with good properties is the 
score test proposed by van den Broek (1995). 

Of course, as pointed out by El-Shaarawi (1985) and Thas and Rayner (2005), the 
rejection of the Poisson model does not imply that the ZIP distribution is the most 
appropriate model to fit the data. It may happen that an alternative model that 
accounts for the observed dispersion could fit the data better. The negative bino- 
mial and the zero-infiated negative binomial distributions are examples of reasonable 
alternatives. 

In this work we introduce a new procedure to detect zero-infiation and overdis- 
persion. The key idea is to link the notion of overdispersion with the concept of 
variability stochastic order. These orders arrange distributions according to their 
variability (see Section 3 of Shaked and Shanthikumar (2006)). Therefore, it is na- 
tural to suppose that the observed overdispersion is due to the data actually coming 
from a different model that dominates the initially assumed distribution in a vari- 
ability order. The most important variability order is the so-called convex order. We 
use the properties of this order to derive suitable discrepancy measures for tests in 
which "overdispersion" is understood as "convex domination" . 

The method we propose is fiexible and easy to implement. It is based on the 
empirical comparison of the expected sample extremes of two ordered models. An 
important feature is that the main ideas can be readily adapted to cover several 
different testing problems: tests for the proportion of structural zeros in zero-infiated 
models; procedures for testing if a parametric model is appropriate against another 
one with more variability; and a new general test to detect overdispersion. We 
illustrate in detail the apphcation of the methodology to the case of the ZIP models, 
but the technique can be analogously applied in other situations. 

The definitions and relevant results on stochastic convex dominance are briefly 
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reviewed in Section [2j These results supply the necessary theoretical background 
for the rest of the paper. In Section [3} we provide a general framework to detect 
overdispersion in ZIP models, but we note that the proposed method is very general 
and can be adapted to many other similar scenarios. We find discrepancy measures 
for tests on the proportion of structural zeros and discuss whether the Poisson model 
is appropriate or we should opt for a different model with more dispersion. In Section 
|4]we establish the relationships, in terms of the convex order, for some zero-inflated 
models usually considered in the literature: the zero-inflated binomial, Poisson and 
negative binomial model. These results allow to extend the previous ideas to these 
important discrete models. Section |5] analyzes the performance of the proposed 
tests via some Monte Carlo studies. Our proposals are very competitive against the 
well-known score test in the cases in which the latter can be applied. In Section 
[6| we analyze the fetal lamb data from Leroux and Puterman (1992) using our new 
procedures. For this data set we conclude that the ZIP distribution should be rejected 
against other models with more variability. This result is consistent with the previous 
work by Thas and Rayner (2005). Moreover, we show that the negative binomial 
model cannot be rejected. Finally, the proofs of the main results are collected in the 
appendix. 

2 The convex order and overdispersion 

In this section, we link the overdispersion phenomenon described in the introduction 
with the convex stochastic order. Given two integrable random variables X and Y, it 
is said that X is less or equal to Y in the convex order, and we denote it by X <cx Y, if 
E{(j){X)) < E(0(y)) for every convex function (p for which the previous expectations 
are well defined. Notice that, by considering the convex functions 0(a;) = ±x, the 
condition X <cx Y implies that EX = EY. Furthermore, if the variables have finite 
second moment, applying the definition of the convex order with 0(x) = (x — EX)^, 
we conclude that Var(X) < Var(y). Of course, establishing the relation X <cx Y is 
much more informative than just knowing Var(X) < Var(y). 
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Roughly speaking, since convex functions take larger values when its argument 
is large, if X <cx Y holds, then Y is more likely to take "extreme values" than 
X. This idea is clear from the following proposition. The result is a consequence of 
Corollary 4. A. 16 and Theorem 4. A. 50 in Shaked and Shanthikumar (2006), regarding 
the expected value of the extreme order statistics of two ordered variables. For A; > 1, 
if (Xi, . . . , Xk) is a random sample of size k from X, we denote by Xi-k the i-th order 
statistic of the sample, -i = 1, . . . , /c. Therefore, Xi-k and Xk-k stand for the minimum 
and maximum of the sample. 

Proposition 1. Let X and Y he integrahle random variables such that X <cx Y . 

(a) For allk>l, EYi,k < EXi,k and EXk-.k < EYk-.k- 

(b) If for some k > 2 EXi^^ = EYi-k or EXk-.k = '^Yk-.k, then X and Y have the 
same distribution. 

For instance, for the ZIP variables defined as in Q, we can prove (see Section [?]) 
that 

^-(^^,^1) <cx ^-(^^,^2), 0<pi<p2<l, 9>0. (2) 

Hence, Proposition [T] jointly with ^ imply that the ZIP variable Y(9,p2) is expected 
to take strictly larger extreme values than Y{9,pi) whenever pi < p2. 

3 Tests for overdispersion in ZIP models 

In this section we exploit Proposition [T] to derive discrepancy measures useful to test 
for overdispersion in ZIP models. We emphasize that the same technique, with the 
obvious modifications, can be applied in a similar way for the zero-inflated binomial 
and negative binomial models (see Section |4]) or, in general, for any pair of ordered 
distributions. 

The discrepancies introduced in this section are defined in terms of the empiri- 
cal counterparts of the expected extreme order statistics. Therefore, our goal is to 



5 



detect (significant) differences between the estimates of tlie expected extremes of two 
distributions. 



Actually, we deal with two different problems. In Subsection 3A_ we propose 
statistical tests to analyze the proportion of structural zeros in ZIP models. In 
other situations, we may want to check if the ZIP model cannot account for the 
dispersion of the data. Then it is adequate to apply the nonparametric procedure of 



Subsection 13.21 



3.1 Tests for the proportion of structural zeros 

Given a random sample Yi, . . . ,Yn from a variable Y{6,p) with the ZIP distribution 
Q, we are interested in testing Hq : p < Po against Hi : p > Po, where po is 
fixed and belongs to [0, 1) (the left unilateral and bilateral tests may be studied by 
similar arguments). There are several works in the literature devoted to this testing 
problem with po = (see e.g. van den Broek (1995), Xie et al. (2001), Jansakul 
and Hinde (2002) and He et al. (2003)). This particular case is important since 
it is equivalent to testing the Poisson model against a ZIP model with a positive 
proportion of structural zeros. However, as far as we know, there are no references 
in the literature including tests for values of ^ (0, 1). 

The method we propose is based on the following simple idea: (|2]) states that 
Y{6^Pi) <cx y{0^P2) whenever < pi < p2 and hence according to Proposition [l| 
the variable Y{6,p) is expected to take strictly larger extreme values under Hi than 
under Hq. Using the information in Yi, . . . ,Yn, we can estimate the expectation of 
the maximum (or minimum) in a generic subsample of size k > 2 from Y{6,p) and 
Y{d,pQ). Then, we reject Hq whenever the difference between the two estimates is 
too large. 

More precisely, we denote by Fjg^p{Ykj:) and Eg^piYi.k) the expected values of the 
maximum and minimum of k independent copies of Y{6,p), respectively. Given the 
random sample Yi, . . . ,Yn from Y{6,p), the maximum likelihood estimates of the 
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parameters 9 and p in the ZIP model satisfy (see Johnson et al. 2005) 

ly-y: 1-no/n 



e = Y 



and p = 1 



(3) 



l-expM/(l-p)) 

where no is the number of zero-counts in the sample. Then, for k > 2, we compute 
the discrepancy measures: 



Ak:k = E. - ^ipoO^k-.k) and Ai., = E,(Y^.,k) - E,JY^., 



(4) 



and reject Hq either if Ak-j. or Ai-^ is too large. Observe that, from the equalities 
E{Y{e,po)) = E{Y{e,p)) and E(Xi:2) + E(X2;2) = 2E(X) (which holds for any 
integrable random variable X), it is readily checked that A2:2 = Ai:2. 

If we denote by Fg^p the distribution function of y(6',p), the discrepancies in ^ 
can be rewritten as: 



A.^ 



k:k 



Ai:A 



oo 

E 

i=0 



(5) 



In practice, we can always truncate the above series to approximate their value. 

To obtain the rejection region of the tests we need to find the distribution of Ak-.k 
or Ai:k for > 2 under Hq. In Theorem [T] we obtain the asymptotic distribution 
of A2:2 when pq = 0. However, in general, the distribution of these quantities is 
rather involved and a simple parametric bootstrap schema can be used instead. The 
following procedure is described for the discrepancy Ak-.k but the corresponding one 
for Ai;k is analogous: 

(a) Find the estimate 9 = Y. 

(b) Extract B parametric bootstrap samples of size n, Yj*^, . . . , Y*f^, for 6 = 1, . . . , 5, 
from the distribution of Y{9,pq). 



(c) For each sample Y*f^, . . . , 1^*^, obtain the estimates 9^ and pl using (3). 
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(d) Compute the discrepancies A^;^ = Eg,^p,{Yk;k) - Eg._p^^(Ffc:fe), b = l,...,B. 

(e) For a significance level a, find Ql.^ia), the (1 — a;)-quantile of the values 
{Alib = l,...,B}. 

The rejection region for the test Rq : p < po versus Ri : p > po, at significance level 
a, is approximated by 

= {Ak-.k > QIM}. (6) 
As it was mentioned before, the case po = corresponds to testing the Poisson 



model against a ZIP model with p > 0. The simulation studies in Subsection 5.2 



show that A2:2 has a good behavior. The use of A2:2 means that we compare what 
we expect to obtain for the maximum (or minimum) of two independent Poisson 
variables with that of two ZIP variables with p > 0. In this case, there is a closed- 
form expression for £90(^2:2) (see Johnson et al. (2005), p. 166): 

M2(^) := E,,o(r2;2) = e + ee-^' iIo{29) + hi29)) , (7) 

where Iq and Ii are modified Bessel functions of the first kind (see e.g. Abramowitz 
and Stegun (1965)). Using Q we can rewrite the discrepancy A2:2 given in (g with 
Po = as 

A2:2 = 2pe + {i-pfM2{e/{i-p)) - M2{e). (8) 

This enables us to obtain the asymptotic distribution of A2:2 under Hq : p = 
(Poissonness). In the following theorem the symbol " — stands for "convergence 
in distribution" and A^(0, 1) is a standard normal variable. 

Theorem 1. Under Hq : p = 0, it holds that 

a{e) 

where 

§2(i_ e-2e ' ^ ^ ' 



l + ^)/o(2^)-/i(2^) + 0/2(2^) 
AO) ■■= — ^ ^, (9) 

and Iq, Ii and I2 are modified Bessel functions of the first kind. 
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As an immediate consequence of Theorem [T| a critical region with asymptotic 
significance level a for Hq : p = against Hi : p > is 

Ra= \Vn^^ > Za\, (10) 



with Za being the (1 — a)-quantile of the standard normal distribution. We remark 
that this test is very simple and easy to implement since the Bessel functions ap- 
pearing in A2:2 and a (9) can be evaluated by any standard mathematical software 
package. 

3.2 A general test to detect overdispersion 

Here, we deal with the problem of detecting if a data set comes from a Poisson 
distribution or there is dispersion that the Poisson model cannot take into account. 
The same procedure works for the more general ZIP model or the distributions 
considered in Section |4| but we illustrate the ideas with the Poisson distribution for 
the sake of simplicity. 

Let us consider the family V := {Y{6) : 9 > 0}, where Y{9) is a Poisson variable 
with mean 9. We denote by Pcx the set of all integrable random variables, not 
having the Poisson distribution, that dominate in the convex order a variable in V. 
Therefore, Vex includes distributions with strictly more dispersion than the Poisson 
variables. In particular, according to (|2| and Proposition |3] in Section |4| all the ZIP 
(with p > 0) and the (zero-inflated) negative binomial distributions are included in 
Vex- Given a random sample Yi, . . . ,Yn from Y, we want to test Hq : Y E V against 
Hi : r G Pex. 

In this new test the alternative hypothesis is not completely specified in the sense 
that it is not given by a parametric family. However, to handle this problem we can 



use similar ideas to those in Subsection 3.1 We first estimate the parameter 9, 9 = Y. 



Then, we compute the expectation of the maximum or minimum of k independent 
copies of Y{9), Eg(Yfc.fc) and Eg(Yi,k), as before in Subsection 



3.1 



On the other hand, 



since there is no parametric restriction under Hi, we estimate EY^-fc and EYi-^ by 
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means of the following nonparametric plug-in estimators: 

k / ■ 1 \ fc" 



i=l 



1 



i=l 



1 



n 



1 



y. 

i:ni 



where is the empirical distribution function of the sample Yi, . . . , Hence, for 
/c > 2, we consider the discrepancies 



Afe:fc := ^F^^k-.k) - ^e^k:k) and Ai.^ := Eg(Fi:fe) - ^F,,ijv.k 



Under Hq these discrepancies are close to whereas, if Hi holds, then Ai-^. and 
are (strictly) positive for n large enough. Therefore, we reject Hq whenever Ai:^ or 
Afc:fe are too large. The rejection region of these tests can be derived by using a 



parametric bootstrap approach similar to the one described in Subsection |3.1[ 

We finally note that we actually have a different test for each discrepancy. The 
power of the test may depend on the selection of the statistic. The choice of a test 



with good power is addressed in Subsection 5.1 



4 Extensions to other models 

The application of the methodology described in the previous section relies on ver- 
ifying the convex domination of the involved variables. In this section, we establish 
all the relationships, according to the convex order, among the zero-inflated versions 
of some commonly used models for count data: the Poisson, the binomial and the 
negative binomial models. For these important discrete models, these relationships 
allow to extend straightaway the ideas developed in the previous section. 

We first note that, given a data set, it is sensible to assume that the models 
that could fit the data have the same mean. Hence, all the parametric distributions 
considered in this section are selected to have the same expectation Q. 

For m>l,0<]9<l and < < m(l — p), let us consider the random variable 
X{m,6,p) which is the mixture between the degenerate-at-zero variable with weight 
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p and a binomial variable of parameters m and 6/[m{l — p)] with weight 1 — p. 
In other words, X{'m,9,p) has the zero-inflated binomial (ZIB) distribution with 
probabilities 



Fr{X{m,e,p) = k) 




m(l—p) 



if = 

m—k 



if 1 < < m. 



Furthermore, we also consider the variable Z{t, 6) with negative binomial (NB) 
distribution of parameters 1/t and t6 {t > and 6 > 0), i.e.. 

Among the different parametrizations of the NB distribution, we have chosen the 
unique one, Z{t, 6), with mean 6 (for all t) and increasing in t for the convex order, 
that is, satisfying Z(ti,9) <cx Z(t2,9) whenever < ti < ^2 (see Proposition [s] (d) 
below) . 

However, there are infinitely many possibilities to inflate with zeros the vari- 
able Z{t, 6) preserving the mean 6. Among them, we only consider the most rep- 
resentative two. On the one hand, for t, > and < p < 1, let Zi{t,6,p) be 
the mixture between the degenerate-at-zero variable with weight p and the variable 
Z{t{l —p), d/{l —p)) with weight 1 —p. On the other hand, for t,6 > and < p < 1 
let Z2{t,6,p) be the mixture between the degenerate-at-zero variable with weight p 
and the variable Z{t, 6 /{I — p)) with weight 1 — p. We refer to these two models as 
the zero-inflated negative binomial (ZINB) models. 

In order to clarify the notation. Table [T] summarizes the relevant information 
about the models considered throughout this section. We note that all the variables 
have a fixed mean 6 and a proportion p of structural zeros. 

The variance of all the zero-inflated variables described before is an increasing 
function of p G [0, 1). Actually, the next proposition shows that they are convexly 
ordered for different values of p. 
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Table 1: Summary of i]ic cousidc^rcxi modc^ls. 



Model 


Notation 




Variance 


ZIB 


X{m,9,p) 


e + 


e'^p e'^ 


1 — p m(l — p) 


ZIP 


Y{9,p) 


e + 


i-p 


ZINB(l) 


Zi{t,e,p) 


e + 


1 — p 


ZINB(2) 


Z2{t,e,p) 


9 + 


e^p OH 



1 — p 1 — p 



Proposition 2. Let X{m,6,p), Y{6,p) and Zi{t,6,p) (i = 1,2) be variables with 
the ZIB, ZIP and ZINB distributions described above. // < pi < p2 < 1, then 

(a) X(m, 6',pi) <cx X{m, 9,p2), for all m > 1 and < 9 < m(l — P2). 

(b) Y(e,pi) <ex Y{e,p2), for all 6 > 0. 

(c) Zi{t, e,pi) <cx Zi{t, e,p2), for allt > 0, e > Q andi = 1, 2. 

The limiting distribution of X{m,9,p) (as m t 00) and of Zi{t,9,p) (as t I 0) 
for i = 1, 2 is the ZIP variable Y(9,p). The smaller m is, the more the ZIB variable 
differs from the ZIP one. Also, the larger t is, the more the ZINB variables differ 

from the ZIP one. 

For a fixed proportion of structural zeros, the next proposition presents the rela- 
tionships among these four discrete models. 

Proposition 3. For a fixed p e [0, 1), we have: 

(a) X{m, 9,p) <cx X{m + 1, 9,p), for all m > 1 and < 9 < m(l — p). 

(b) X{m,9,p) <cx Y{9,p), for all m > 1 and < 9 < m(l -p). 
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(c) Y{9,p) <cx Zi{t,e,p) <cx Z2{t,e,p), for all e > andt > 0. 

(d) Zi{ti,e,p) <cx Zi{t2,e,p), for alio < h < t2, > and i = 1,2. 

Proposition [2] allows to test on the proportion of structural zeros in all the mod- 
els of this section. Further, Proposition [3] makes possible the comparison of these 



parametric families. The nonparametric tests described in Subsection 3.2 can also 
be adapted to these models. An example of the application of these tests can be 
found in Section [6l 

5 Simulations 

We have carried out a Monte Carlo study to check the performance of the tests 
described above. The simulations also give insight into the choice of the suitable test 
statistic. The significance level in all cases is fixed as a = 0.05. 

5.1 The choice of the discrepancy measure 

The approach discussed in Section |3] generates a family of discrepancies for the ad- 
dressed testing problems. We actually have a different test if we select the maximum 
or minimum in the discrepancy: A^-.k or Ai^^ in the tests of Subsection |3.1| and Ak-.k 



or Ai:fc in the nonparametric case of Subsection 3^ Moreover, the test statistics also 
differ for each k >2. Hence, the question of finding a test with good power arises. 

Regarding the tests on the proportion of structural zeros discussed in Subsection 
3.1[ observe that both hypotheses assume that the observations follow a parametric 
(ZIP) distribution. The tests mainly rely on the estimation of the parameters of 
the model, and the choice of the discrepancy is of secondary importance. Some 
preliminary simulations showed that different discrepancies and values of k yield 
similar powers. Therefore, in this situation we opt for the simplest one A2:2 = Ai:2 
defined in dsj), which has computational advantages over the others with larger fc's. 



We now turn to the test for overdispersion of Subsection 3.2 Hq is given by a 



parametric model whereas Hi includes all the distributions that strictly dominate an 
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element of the initial family. Hence, Hi is not specified by any parametric family. In 
this case, the power of the tests strongly depends both on the distribution generating 
the data and on the parametric family assumed in Hq. For a fixed discrepancy, dif- 
ferent alternatives could lead to very different powers. Therefore, it is advantageous 
to have a family of discrepancies since this provides flexibility to select a good test 
in each situation. 

Let us briefly explain how the coefficient of variation (CV) of the discrepancy is 
useful to choose a test with good properties. Under Hi, an adequate discrepancy to 
detect deviations from Hq should have a large mean and low variance, that is, a low 
CV. The CV of the discrepancy describes well how the corresponding test behaves. 
In general, under Hi, a low CV is paralleled by a high power. This is clearly reflected 
in Figure [T| where, for 1000 Monte Carlo samples, we plot the power of the test for 
over dispersion for the Poisson family and the inverse of the CV of the discrepancy 
Ai-k defined in (11), for different values of k. In Figure [T]^a), the observations are 



generated from a ZIP distribution y(3,0.05), while in Figure [T]^b) they are drawn 
from the NB distribution Z(0.05,3). In the first value of k around 20 is a 

good choice, but in the second case = 2 is clearly the best one. Therefore we use 
these two values of k in the simulations of Subsection 15.31 

We finally note that when analyzing only one data set, it also becomes possible 
to choose a suitable discrepancy by estimating its CV via bootstrap (see Section |6] 
for details). 

5.2 Simulations for the test on the proportion of structural zeros 

We consider the test on the proportion of structural zeros in a ZIP model (Subsec- 



tion 3.1). As argued in Subsection 5.1, we select k = 2. For the case Po = (Hq 
represents the Poisson distribution), we compare the performance of the score test 
(van den Broek 1995) and the test methodology that rejects Hq if the discrepancy 
Ai.2 = A2:2 in ^ is too large. The rejection region for the latter method is chosen 
in two ways: via bootstrap as in ^ and also using the asymptotic distribution of A2:2 
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as in (10). The number of bootstrap samples is B = 5000. 

In Table |2] we record the proportion of times that Hq : p = is rejected. For each 
combination of p and 6 in the table, we generate 5000 Monte Carlo samples of sizes 
n = 50, 100 and 200 from Y{6,p). Note that our proposed procedure has a very 
competitive performance in comparison to the score test. This is specially apparent 
for the lowest values of 6, where, when p > 0, in general our procedure yields a higher 
power than the score test. 

In Table |3] the results for the test Hq : p < 0.2 against Hi : p > 0.2 are displayed. 
In this case we only use the procedure based on A2:2 with rejection region The 
number of Monte Carlo samples is again 5000. 

5.3 Simulations for the overdispersion test 

We test Ho : Y ^ V (V being the Poisson family) against Hi : F G Vex following 



the procedure described in Subsection 3.2 The number of Monte Carlo samples is 



5000 and the number of bootstrap samples used to compute the rejection region is 
B = 5000. We generate observations with sample sizes n = 50, 100 and 200, from 
a ZIP distribution Y{9,p) and apply the nonparametric procedure based on Ai:2o- 
Afterwards, we generate samples from the NB distribution Z(t,6) and carry out 
the test with A1.2. Recall that the justification for selecting such discrepancies was 
detailed in Subsection 5A_ In Tables |4] and |5] we display the proportion of times that 



Ho is rejected. Observe how close the powers in Table |4] are to those of Table [2j 
We found this property appealing since in this test for overdispersion no parametric 
model is specified for the alternative hypothesis. 



6 An example with real data 

To illustrate the usefulness of the methods proposed throughout the paper, we ana- 
lyze a data set from Leroux and Puterman (1992). The number of movements by a 
fetal lamb observed through ultrasound were recorded. We consider one particular 
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sequence of counts of the number of movements in each of 240 consecutive 5-second 
intervals (see Table [6]). 

If we assume that the data follow a Poisson distribution with mean 9, the estimate 
of 6' is 6' = 0.36. The differences between the observed and the expected frequencies 
in Table |6] point out that the Poisson model is unsuitable. Douglas (1994) used the 
Pearson statistics to argue that a ZIP model provides a substantially improved 
fit. The estimates of the parameters under the ZIP model are 9 = 0.36 and p = 0.58. 
The corresponding expected frequencies are in the fourth row of Table [6j The fit 
seems indeed better, but we could formalize this statement by testing Hq : p = 



versus Hi : p > 0. We apply both the asymptotic test (10) and the score test. 
Both results point out a strong evidence (p- values below 0.0001) against the Poisson 
model. This leads us to the conclusion that the ZIP distribution fits the data much 
better than the Poisson one. 

Rejecting the Poisson model does not necessarily imply that the ZIP model pro- 
vides the best fit. Another model could account better for the observed dispersion. 



Therefore, using the nonparametric test developed in Subsection 3.2, we now test 
the null hypothesis that the distribution is ZIP against the alternative that the true 
model has more variability than the ZIP one. In this case, we have to select the 



appropriate statistics (Ai:^ or Ak-k) and a suitable value for k (see Subsection 5.1). 
For that purpose, we obtain bootstrap estimates (based on 500 bootstrap samples) of 
the inverse of the CV of Ai-k and A^.^, for different fc's. The estimates as a function 
of k are displayed in Figure |2} 

According to the results depicted in Figure |2} the test based on Ak-.k is preferable. 
Moreover, for A^.^, there is a wide range of k values (between 50 and 200, say) for 
which the results are fairly similar. For the tests based on A^.^ with k = 50, 90, 130 
the p- values are under 0.0005. We conclude that the ZIP model is also clearly rejected 
so that other distributions with higher dispersion are more appropriate to fit this data 
set. Other authors have reached the same conclusion by rather different approaches. 
For instance, Ridout et al. (2001) reject the ZIP against the ZINB using a score 
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test in the spirit of van den Broek (1995). Thas and Rayner (2005) reject the ZIP 
against general smooth ahernatives in the sense of Neyman. A generahzed Poisson 
distribution to fit this data set has also been proposed by Gupta et al. (1996). 

A simpler alternative to model this data is the NB distribution. The estimated 
parameters are 9 = 0.36 and t = 1.89, and the corresponding expected frequencies 
can be found in the fifth row of Table [6j At first sight it seems the fit provided by 
the NB is slightly better than the one furnished by the ZIP. To confirm this feature, 



we adapt the nonparametric procedure described in Subsection |3.2| to test the null 
hypothesis that the data come from a NB distribution against the alternative that 
the data come from a distribution that dominates the NB in the convex order. 

We have used bootstrap estimates of the inverse of the CV to conclude that in this 
case Ak;k with A; ^ 8 yields an appropriate test (details are omitted). The p- values 
of the tests for k = 4,6, 8, 10, 12 are all above 0.33. Therefore, we cannot reject the 
null hypothesis and conclude that the NB distribution accounts for the dispersion of 
the data better than the ZIP model. 



7 Appendix: Proofs 

Proof of Proposition [2] 

We need to introduce some notation. Given two integrable random variables X 
and Y, it is said that X is smaller than Y in the increasing convex order, written 
X <icx Y, if E(0(X)) < E(0(y)), for all increasing and convex function 0, provided 
the expectations exist. It is easy to see that 

X <cx Y if and only if X <icx Y and EX = EY. (12) 

Therefore, since all the variables considered in Proposition [2] have the same expec- 
tation 6, if suffices to show that they are ordered for the increasing convex order. 
Moreover, since the proof of parts (a), (b) and (c) with i = 1 are similar, we only 
consider the case of ZIP variables (part (b) of the proposition). 
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We first note that the family V := {Y(9) : G [0, oo)} , where Y{9) is a Poisson 
random variable of mean 6* > (1^(0) = 0) is stochastically increasing and convex 
(see Example 8. A. 2 in Shaked and Shanthikumar (2006)). For < pi < p2 < 1, we 
define the random variables (independent of the variables in V) 0j = j:^i?(l — Pi) 
[i = 1,2), where B{1 — pi) is a Bernoulli variable of parameter 1 — pi. It is readily 
checked that Bi <cx 02- Therefore, a direct application of Theorem 8. A. 14 (p. 
362) in Shaked and Shanthikumar (2006) yields P{Qi) <icx -^(©2), and taking into 



account (12), we conclude P{&i) <cx -P(02)- Therefore, the proof of part (b) is 
finished since the ZIP variable Y{6,pi) has the same distribution as P(0i) {i = 1, 2). 

The previous argument, based on the properties of stochastically increasing and 
convex families, cannot be used to prove part (c) with i = 2 since it has not been 
established yet whether the collection of negative binomial variables is stochastically 
increasing and convex in its second parameter. We therefore need to introduce 
another technique inspired in the ideas used to prove Lemma 10 in de la Cal and 
Carcamo (2005). Fix t > 0, 6 > and < pi < < 1 and let Z2(t, e,pi) {i = 1, 2) 
be the ZINB distributions defined in Section |4j Taking into account Lemma 9 in de la 
Cal and Carcamo (2005) and Theorem 3. A. 44 (p. 133) in Shaked and Shanthikumar 
(2006), to prove part (c) (with z = 2) it is enough to show that the function 

p{k) := Pr(Z2(t,^,pi) = k)- PT{Z2{t,e,p2) = k), k> 0, (13) 

has two changes of sign, being the sign sequence —,+,—. To show this, we first 
consider the function 

_Pr(Z2MM^ 
^^^^-Pr(Z2(t,^,P2) = fc)' 

After some simple computations, it is easy to check that the function f{p) : = 
PY{Z2{t,9,p) = 0) is an increasing function of p G [0,1). Therefore, ^^(O) < 1. 
Also, since 

ip{k + l) l-p2 + 9t 



ip{k) l-pi + et 



=: c < 1, /c > 1, 
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we have that f{k) = ^^{^) {k > 1) and this entails (p{k) | as 1 < A; t oo. 
Moreover, the equahty Xlfclo P^(^2(t, 6*,^!) = k) = 1 = Pr(Z2(t, 6',p2) = k) 

yields > 1. This implies the desired result and the proof is complete. 

Proof of Proposition [3] 

In the case p = 0, parts (a)-(d) follow from Lemmas 5 and 10 in de la Cal and 
Carcamo (2005) and Theorem 3. A. 44 (p. 133) in Shaked and Shanthikumar (2006). 
Therefore, using that the convex order is closed under mixtures (see Theorem 3. A. 12 
(p. 119) of Shaked and Shanthikumar (2006)), we conclude that for any fixed < 
p < 1, (a)-(c) and the first stochastic inequality in (d) are valid. To finish, we observe 
that the distribution of Zi(t, 9,p) is the same as the distribution of Z2(t{l — p), 9,p) 
and applying part (c) of Proposition [3| we get Z2{t{l — p),6,p) <cx ^2(1, 9, p). This 
shows that Zi(t, 9,p) <cx Z2(t, 9,p) and the proof is complete. 

Proof of Theorem 1 

We first note that the discrepancy A2:2 = ^2:2(9, p) given in ([8| is a smooth function 
of the maximum likelihood estimates, 9 and p. Therefore, the desired asymptotic 
distribution can be obtained combining the classical asymptotic theory for maximum 
likelihood estimators and the delta method. 

According to the the asymptotic theory for maximum likelihood estimators, we 
have that: 

V^{9-9,p-py ^dNi{0,Oy,J:), n^oo, 

where A^((0,0)*,S) is a bivariate normal distribution centered at the origin with 
covariance matrix S. The matrix S is the inverse of the expected Fisher information 
matrix, that is, = —E0^p[i"{Y; 9,p)], where i"{y; 9,p) is the 2x2 matrix of second 
partial derivatives with respect to 9 and p of the log-likelihood function i{y;9,p). 
Using this result, after some algebra it is possible to show that, under Hq : p = 0, 

V^i9 - 9,py -^d N((0, 0)*, So), 00, 
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where 

' 9 
(e^ - 1 

Now, let VA2:2{9,p) be the gradient of A2:2{9,p) evaluated at {9,p). Using the delta 
method (see e.g. van der Vaart (1998), Theorem 3.1., p. 26) we deduce that, under 
Ho : P = 0, 

v^A2:2 -^d iV(0, cr''{9)), n^oo, (14) 
where a'^{9) := VA2:2{9, 0)* ■ Sq • VA2:2(6', 0). Now we observe that 

^^^dM =29- M2{9) - 9'e-'%h{29) - 

where the function M2 is defined in ([7]). To obtain the last equality above we use the 
following properties of the modified Bessel functions of the first kind: I'oix) = Ii{x) 
and I[{x) = [Io{x) + l2{x)]/2 (see Abramowitz and Stegun (1965), properties 9.6.27 
and 9.6.29, p. 376). Replacing these partial derivatives and the matrix Sq in the 
expression VA2:2(6', 0)* ■ So ■ VA2:2(6', 0) yields 

{29 - M2{9) - 9^e~^'[Io{29) - l2{29)])^ 



dA2:2{9,p) 



09 



0, 



p=0 



a\9) 



1-9 



9' (1 - [(1 + 9)Ioi29) - h{29) + 0/2(2^)])' 



Finally, it is obvious that a {9) defined in is a consistent estimator of the stan- 



dard deviation (j(9). As a consequence, from (14) we also deduce that the conclusion 
of Theorem [T] holds. 
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Table 2: Proportion of times that Hq : p = was rejected. 



P 



n 


d 





0.05 


0.1 




50 


3 


0.047 
0.056 
0.036 


0.386 
0.422 
0.313 


0.784 
0.800 
0.722 


Bootstrap 

Asymptotic 

Score 


50 


5 


0.041 

0.055 
0.044 


0.768 

0.794 
0.779 


0.972 

0.981 
0.978 


Bootstrap 
Asymptotic 

Score 


50 


10 


0.002 
0.002 
0.002 


0.923 
0.923 
0.923 


0.994 
0.994 
0.994 


Bootstrap 

Asymptotic 

Score 


100 


3 


0.052 
0.059 
0.049 


0.585 
0.604 
0.494 


0.964 
0.963 
0.943 


Bootstrap 

Asymptotic 

Score 


100 


5 


0.043 
0.077 

0.045 


0.944 
0.966 

0.945 


0.999 
1.000 

0.999 


Bootstrap 
Asymptotic 

Score 


100 


10 


0.003 
0.003 
0.003 


0.994 
0.994 
0.994 


1.000 
1.000 
1.000 


Bootstrap 

Asymptotic 

Score 


200 


3 


0.051 

0.054 
0.048 


0.827 

0.831 
0.762 


0.999 

0.999 
0.999 


Bootstrap 

Asymptotic 
Score 


200 


5 


0.050 
0.065 
0.043 


0.999 
0.999 
0.999 


1.000 
1.000 
1.000 


Bootstrap 

Asymptotic 

Score 



0.007 1.000 1.000 Bootstrap 
200 10 0.007 1.000 1.000 Asymptotic 
0.007 1.000 1.000 Score 
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Table 3: Proportion of times that Hq : p < 0.2 was rejected. 



P 



n 


e 


0.2 


0.25 


0.3 


50 


3 


0.069 


0.272 


0.581 


50 


5 


0.067 


0.253 


0.554 


50 


10 


0.061 


0.244 


0.546 


100 


3 


0.062 


0.366 


0.781 


100 


5 


0.065 


0.346 


0.780 


100 


10 


0.061 


0.370 


0.774 


200 


3 


0.061 


0.536 


0.953 


200 


5 


0.065 


0.557 


0.962 


200 


10 


0.069 


0.584 


0.963 



Table 4: Proportion of rejections of Hq : Y ^ V when 
sampling from a ZIP Y{9,p). 



n 


e 




P 







0.05 


0.1 


50 


3 


0.043 


0.358 


0.780 


50 


5 


0.045 


0.732 


0.966 


50 


10 


0.051 


0.901 


0.993 


100 


3 


0.047 


0.576 


0.952 


100 


5 


0.052 


0.911 


0.999 


100 


10 


0.053 


0.982 


1.000 


200 


3 


0.054 


0.839 


0.999 


200 


5 


0.054 


0.992 


1.000 


200 


10 


0.050 


0.999 


1.000 
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Table 5: Proportion of rejections of Hq : Y E V when 
sampling from a NB Z{t,9). 



t 



n 


e 


0.05 


0.1 


50 


3 


0.183 


0.385 


50 


5 


0.309 


0.635 


100 


3 


0.269 


0.583 


100 


5 


0.479 


0.871 


200 


3 


0.409 


0.812 


200 


5 


0.710 


0.989 



Table 6: Lamb data set and expected frequencies based on 
Poisson, ZIP and NB distributions. 



Outcome 





1 


2 


3 


4 


5 


6 


7 


Obs. Freq. 


182 


41 


12 


2 


2 








1 


Expect. Freq. (Poisson) 


167.7 


60.1 


10.8 


1.3 


0.1 


0.0 


0.0 


0.0 


Expect. Freq. (ZIP) 


182.0 


36.9 


15.6 


4.4 


0.9 


0.2 


0.0 


0.0 


Expect. Freq. (NB) 


182.5 


39.0 


12.0 


4.1 


1.5 


0.5 


0.2 


0.1 



25 



Figure 1: Power (in black) of the overdispersion test for the 
Poisson family and 1/CV of the discrepancy (in grey). 
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Figure 2: Bootstrap estimates of CV for A/j:^ (solid line) 
and Ai-fc (dashed line) for several values of k. 
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